Aligning, Annotating and Lemmatizing a Corpus for the Validation of Balkan Wordnets
نویسندگان
چکیده
In this paper we discuss the usage of corpora in the validation of WordNets and we present the exploitation of the Greek version of George Orwell ́s Nineteen Eighty-Four for the construction and validation of the Greek WordNet, which is currently under development in the framework of the BalkaNet project. In particular, we focus on the description of tools that were developed and used for the alignment, the annotation and the lemmatization of the corpus.
منابع مشابه
Facilitating Multi-Lingual Sense Annotation: Human Mediated Lemmatizer
Sense marked corpora is essential for supervised word sense disambiguation (WSD). The marked sense ids come from wordnets. However, words in corpora appear in morphed forms, while wordnets store lemma. This situation calls for accurate lemmatizers. The lemma is the gateway to the wordnet. However, the problem is that for many languages, lemmatizers do not exist, and this problem is not easy to ...
متن کاملFine-Grained Word Sense Disambiguation Based on Parallel Corpora, Word Alignment, Word Clustering and Aligned Wordnets
The paper presents a method for word sense disambiguation based on parallel corpora. The method exploits recent advances in word alignment and word clustering based on automatic extraction of translation equivalents and being supported by available aligned wordnets for the languages in the corpus. The wordnets are aligned to the Princeton Wordnet, according to the principles established by Euro...
متن کاملWord Sense Disambiguation as a Wordnets' Validation Method in Balkanet
BalkaNet is a European project which aims at the development of monolingual wordnets for five languages in the Balkans area (Bulgarian, Greek, Romanian Serbia, and Turkish) and at improvement of the Czech wordnet developed in the EuroWordNet project. The wordnets are aligned to the Princeton Wordnet, according to the principles established by the EuroWordNet consortium. One of the main concerns...
متن کاملThe Cross-Breeding of Dictionaries
Especially for English, the number of hand-coded electronic resources available to the Natural Language Processing Community keeps growing: annotated corpora, treebanks, lexicons, wordnets, etc. Unfortunately, initial funding for such projects is much easier to obtain than the additional funding needed to enlarge or improve upon such resources. Thus once one proves the usefulness of a resource,...
متن کاملPAYMA: A Tagged Corpus of Persian Named Entities
The goal in the named entity recognition task is to classify proper nouns of a piece of text into classes such as person, location, and organization. Named entity recognition is an important preprocessing step in many natural language processing tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003